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Estimating  Interrater  Reliability 
in  Incomplete  Designs 

The  use  of  various  forms  of  the  intraclass  correlation 
coefficient  to  estimate  interrater  reliability  has  been 
addressed  in  a  number  of  recent  articles  (Bartko,  1976;  Bintig, 
1980;  Fleiss  §  Shrout,  1978;  Kraemer  §  Korner,  1976;  Saal, 

Downey,  §  Lahey,  1980;  Shrout  §  Fleiss,  1979).  While  these 
articles  have  focused  on  "complete  designs",  where  each  target 
is  rated  by  each  judge  on  one  or  more  dimensions  (variables) , 
several  have  touched  on  "incomplete  designs"  in  which  each  of 
K  targets  is  rated  by  a  different  set  of  judges  (i.e.,  judges 
are  nested  within  targets)  using  the  same  rating  variable(s) . 

A  form  of  intraclass  correlation  may  be  used  to  provide  consis¬ 
tent  estimates  of  interrater  reliability  for  incomplete  designs, 
the  typical  question  being  whether  the  judges  within  each  of 
the  JK  targets  agreed  with  respect  to  their  ratings  (cf.  Ebel, 
1951;  Guilford,  1954;  Shrout  §  Fleiss,  1979;  Winer,  1971). 

The  incomplete  design  is  employed  frequently  in  areas  such 
as  climate  research,  where  n^  employees  nested  in  each  of  K 
(k*l,...,K)  organizations  report  perceptions  on  a  climate  variabl 
such  as  "managerial  support"  (cf.  Insel  6  Moos,  1974;  James  § 
Jones,  1974).  An  interrater  reliability  is  computed  from  infor¬ 
mation  furnished  by  a  random  effects,  one-way  ANOVA.  That  is, 
each  of  the  K  targets  (organizations)  assumes  the  role  of  a 
treatment,  and  ratings  (perceptions)  provided  by  the  n^  judges 
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(employees)  furnish  values  on  the  dependent  variable  "X".  The 
n^  need  not  be  equal.  A  one-way  ANOVA  is  conducted,  where  a 
significant  F_  suggests  that  variation  among  the  scores  on  X 
was  associated  more  with  differences  among  targets  than  with 
pooled  differences  among  judges  nested  within  targets.  An 
estimate  of  interrater  reliability  is  obtained  by  the  following 
equation  (cf.  Ebel,  1951). 

MSB  -  MSW 

ICC  *  -  (1) 

MSB  +  (nk  -  1)  MSW 

where  ICC  is  an  intraclass  correlation,  MSB  is  the  mean  square 
for  between-groups  (targets),  MSW  is  the  within-group  mean 
square,  and  n^  is  a  harmonic  mean  based  on  the  number  of  judges 
per  group  k.  The  more  convential  term  "group"  refers  to  all 
judges,  or  raters,  who  rated  the  same  target. 

Interrater  reliability  is  viewed  here  as  a  function  of 
the  degree  to  which  raters  who  rated  the  same  target  agreed 
with  respect  to  their  ratings;  that  is,  high  interrater  reli¬ 
ability  is  indicated  by  high  within-group  agreement,  and  low 
interrater  reliability  is  indicated  by  lack  of  within-group 
agreement.  The  terms  interrater  reliability  and  agreement  are 
used  interchangeably. 

The  initial  objective  of  this  report  is  to  demonstrate 
that  the  ICC  above  may  provide  a  seriously  misleading  indicator 
of  interrater  reliability.  First,  inspection  of  Eq.  1  demon¬ 
strates  that  as  MSW  decreases,  the  ICC  increases.  Thus,  the 
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ICC  is  a  function,  in  part,  of  the  extent  to  which  within-group 
agreement  is  present,  as  shown  by  the  degree  to  which  raters 
within  each  group  give  the  same  ratings  (Bartko,  1976).  Note, 
however,  that  the  ICC  and  MSW  are  based  on  pooled  data,  and 
the  ICC  estimate  applies  to  all  groups.  If  the  separate  within- 
group  variances  are  not  homogeneous,  then  the  ICC  may  overest¬ 
imate  agreement  for  some  groups,  estimate  it  accurately  for 
others,  and  underestimate  it  for  still  others.  This  potential 
problem  may  be  checked  empirically  by  a  homogeneity  of  variance 
test.  Suppose,  however,  that  the  null  hypothesis  of  equal 
variances  is  rejected.  Does  this  suggest  that  interrater  reli¬ 
ability  cannot  be  estimated?  Certainly  not;  it  suggests  only 
that  a  between-group  design  should  not  be  used  and  that  a  separate 
estimate  of  agreement  should  be  obtained  for  each  group. 

A  second  and  more  important  point  is  that  even  with  high 
agreement  among  raters  in  each  group,  the  ICC  may  be  very  low. 
Consider,  for  example,  a  scenario  in  which  (a)  the  raters  in 
each  one  of  K  groups  responded  almost  exactly  the  same,  which 
denotes  close  to  perfect  agreement  among  the  ratings  for  each 
target  and  a  low  MSW;  and  (b)  the  mean  scores  for  all  K  groups 
were  essentially  identical,  in  which  case  MSB  is  zero,  or  approx¬ 
imately  so.  Inspection  of  Eq .  1  demonstrates  that,  given  these 
conditions,  the  ICC  would  be  equal  to  zero,  or  even  negative 
in  value. 

To  illustrate,  consider  the  data  presented  in  Table  1, 

The  data  consist  of  scores  on  a  random  variable  X,  which  has 
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five  discrete,  equally  spaced  alternatives,  for  20  individuals 
in  each  of  two  groups.  In  accordance  with  assumptions  under¬ 
lying  the  use  of  ANOVA  and  the  ICC  (cf.  Shrout  §  Fleiss,  1979; 
Winer,  1971),  it  is  presumed  that  (a)  groups  (e.g.,  organizations) 
and  raters  (e.g.,  employees)  were  randomly  sampled  from  popula¬ 
tions  to  which  inferences  regarding  groups  and  raters  are  to 
be  made,  (b)  raters  rated  independently,  (c)  the  within-group 
residual  components  are  independently  and  normally  distributed 
in  the  population,  and  (d)  the  within-group  variances  are  homo¬ 
geneous.  The  response  frequencies  in  Table  1  indicate  that 
individuals  in  each  group  tended  to  agree.  Agreement  is  also 
reflected  by  the  small  within-group  variances  (.211  and  .261). 
However,  not  only  is  the  F-test  nonsignificant  (£  >  .05),  but 
the  ICC  is  -.047,  which  is  regarded  as  .00  (Bartko,  1976). 

This  low  and  obviously  misleading  ICC  is  attributed  to  the 
essential  absence  of  variation  among  the  group  mean  ratings 
(3.00  and  3 . 05)  . 


Insert  Table  1  about  here 


We  hasten  to  note  that  lack  of  variation  among  group  mean 
ratings  does  not  automatically  indicate  a  misleading  ICC.  For 
example,  low  variation  among  means  accompanied  by  high  variation 
among  ratings  within  groups  provides  an  accurate  ICC  of  approx¬ 
imately  zero.  Our  concern  is  limited  to  conditions  of  the  type 
displayed  in  Table  1,  where  low  variation  among  group  mean 


Interrater  Reliability 
6 

ratings  accompanied  by  low  within-group  variance  results  in 
inaccurate  estimates  of  agreement.  Moreover,  we  submit  that 
such  conditions  are  neither  unrealistic  nor  trivial.  Consider, 
for  example,  a  study  in  which  a  different  sample  of  n^  inspectors 
(raters)  rates  each  of  K  airplanes  (groups,  targets),  selected 
randomly  from  those  airplanes  owned  by  a  particular  airlines 
company.  It  is  not  unreasonable  to  assume  that  (a)  this  company 
has  followed  rigorous  maintenance  standards  in  the  interest  of 
satisfying  safety  criteria,  and  (b)  the  n^  raters  of  each  airplane 
rate  that  airplane  highly  in  regard  to  safety.  Are  we  to  conclude 
that,  based  on  Eq.  1,  the  inspectors  failed  to  agree  with  respect 
to  the  safety  of  the  airplanes? 

In  summary,  an  ICC  based  on  the  one-way,  between-group  ANOVA 
design  has  potentially  serious  deficiencies  as  an  estimator 
of  interrater  reliability/agreement.  The  approach  should  not 
be  used  if  group  variances  are  heterogeneous.  Moreover,  given 
homogeneity  of  variance,  the  ICC  will  be  misleading  if  group 
mean  ratings  do  not  vary  and  low  variation  exists  among  ratings 
within  groups.  Given  either  of  these  situaitons,  a  different 
method  is  needed  to  estimate  interrater  reliability/agreement. 

The  second  objective  of  this  paper  is  to  propose  such  a  method. 

Estimating  Interrater  Reliability 
Using  a  Within-Group  Design 

The  proposed  method  for  estimating  interrater  reliability 
is  based  on  a  within-group  design.  A  within-group  design  was 
selected  because  a  separate  estimate  of  agreement  may  be  obtained 
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for  each  group  in  an  incomplete  design,  and,  of  major  importance, 
this  estimate  is  not  affected  by  either  failure  to  have  large 
between- group  differences  or  lack  of  homogeneity  of  within- 
group  variances.  Furthermore,  the  estimates  for  each  group  may 
be  averaged  to  furnish  an  overall  estimate  of  agreement  for  all 
groups  if  the  homogeneity  of  variance  assumption  is  satisfied. 

As  we  shall  demonstrate,  this  average  may  be  substantially  higher 
and  an  obviously  more  accurate  estimator  of  interrater  reliability 
than  the  ICC  in  the  condition  of  major  concern  (i.e.,  high 
within-group  agreement  and  low  between-group  variation) . 

Within-group  approaches  for  estimating  interrater  reliability 
are  not  new  (Bintig,  1980;  Finn,  1970;  Selvage,  1976).  However, 
some,  but  not  all,  methods  and  logical  principles  presented  here 
differ  from  those  of  earlier  treatments.  In  addition,  Cooper 
(1976)  and  Hsu  (1979)  presented  exact  small  sample  and  approxi¬ 
mate  large  sample  tests  designed  to  ascertain  if  raters  within 
a  group  agreed  significantly  with  respect  to  their  ratings  on 
a  single  Likert-type  item.  These  articles  dealt  only  with 
significance  tests  and  did  not  furnish  a  basis  for  estimating 
an  interrater  reliability  coefficient.  The  present  authors  are 
in  agreement  with  Cohen  (1960)  ,  who  expressed  the  opinion  that 
when  reliability  is  of  concern,  significance  is  a  trivial  point 
because  "one  usually  expects  much  more  than  this  (i.e.,  signif¬ 
icance)  in  the  way  of  reliability  in  psychological  measurement" 

(p.  44).  Consequently,  we  have  devoted  our  attention  to  point 
estimates  of  interrater  reliability  and  do  not  consider  signif¬ 


icance  tests. 
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The  presentation  of  the  estimating  procedure  begins  with 

2 

the  variance  on  a  rating  variable  "X"  in  one  group,  or  s^  . 

For  example,  in  Table  1,  =  .211  for  Group  1.  An  s^2  =  0 

indicates  perfect  agreement.  Typically,  however,  s^2  f  0,  in 
which  case  the  question  is  the  degree  to  which  raters  in  the 
group  agreed  with  respect  to  their  ratings.  To  develop  a 

statistic  that  estimates  degree  of  agreement,  it  is  necessary 

2 

to  have  a  standard  or  benchmark  to  compare  to  s„  .  Inasmuch  as 

'  "  _ A. 

2 

s^  >0  reflects  departure  from  perfect  agreement,  we  shall  adopt 

2 

a  benchmark  that  reflects  the  expected  value  of  s^  in  a  condition 
of  absence  of  agreement.  This  expected  variance  is  referred  to 


Procedures  for  determining  a  value  of  a ^  for  a  single 
Likert-type  scale  in  a  single  group  have  been  presented  by 
Cooper  (1976),  Finn  (1970),  Hsu  (1979),  and  Selvage  (1976).  Finn 
and  Cooper  argued  that  an  interrater  reliability  of  zero  occurs 
if  raters  responded  randomly  to  an  item.  Random  responding 
implies  that  each  alternative  on  the  rating  scale  has  an  equal 
likelihood  of  response.  Equal  likelihood  of  response,  combined 
with  assumptions  that  (a)  raters  responded  independently  and  (b) 

the  item  X  is  a  discrete  random  variable  with  multiple  alter- 

.  2 

natives  arranged  on  an  interval  scale,  suggests  that  may  be 

calculated  using  the  equation  for  the  variance  of  discrete, 
uniform  distribution.  Specifically,  define  the  item  X  as  a 
random  variable  which  assumes  A  (a=l,...,A)  finite,  equally 
spaced  alternatives  (i.e.,  A  corresponds  to  the  number  of 
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alternatives  on  X) .  Equal  likelihood  of  response  connotes  that 

each  value  of  A  has  the  same  probability  of  occurrence,  or 

P(X  =  a)  =  1/A.  In  other  words,  the  distribution  of  X  is  uniform 

or  rectangular.  As  shown  in  a  number  of  statistical  texts  (cf. 

Mood,  Graybill^  Boes,  1974),  the  expected  variance  of  X  is 

then: 

Var(X)  =  E(X2)  -  [E(X)]2 

=  (A2  -  1)/12  (2) 

2 

Equation  2  provides  the  desired  value  of  a  and  is  inter- 

E 

preted  as  the  expected  variance  of  X  associated  with  equal 

likelihood  of  response  and  zero  interrater  reliability.  A 

2 

critical  point  to  be  made  about  a  is  that  it  is  a  benchmark 

connoting  an  absence  of  agreement  and  is  to  be  viewed  as  a 

statistical  abstraction.  Whether  raters  would  ever  respond  to 

an  item  in  a  sheerly  random  fashion  has  no  bearing  on  the 

appropriateness  of  relying  upon  hypothetical  random  responding 

as  a  statistical  referent  for  assessing  the  extent  to  which  a 

set  of  actual  responses  resemble  a  set  of  random  responses. 

2 

The  benchmark  og  is  now  employed  to  estimate  the  variance 
in  ratings  due  to  nonerror  variance  and  then  interrater  reliabil¬ 
ity.  Consider  first  that  an  observed  score  on  X,  designated 

X ^  (i*l,...,nje  subjects)  may  be  represented  as  X^  =  y  +  (X  -  y) 

+  e^,  where  £  and  _X  are  the  population  and  sample  means  on  tne 
item,  respectively,  and  is  an  error  of  measurement.  The 

7 

variance  of  the  X- ,  or  s  ,  in  a  sample  arises  only  from  varia- 
_ z  — 

2 

tion  in  the  e.,  and  thus  sv  is  referred  to  as  "error  variance". 
_ 1  _x 

If  the  Xi  are  reflective  solely  of  y,  and  thus  are  entirely 


r 
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2 

devoid  of  error  variance,  then  s  =  0.  On  the  other  hand,  if 

A 

the  X.  are  a  function  of  error  exclusively  and  conform  to  equal 

—  2  2 
likelihood,  random  responses,  then  sY  =  a  .  This  suggests  that 

_A 

the  extent  to  which  the  are  actually  reflective  of  and  may 

be  said  to  reveal  nonerror  or  true  variance,  is  indicated  by 

?  2  2  2 

-  sv  .  The  use  of  a _  -  s„  to  estimate  true  variance  is 

E  a  E  X 

a  heuristic  designed  to  "breakout"  of  a  closed  system  in  which 
restrictions  in  variances  preclude  the  use  of  traditional 
statistical  procedures.  Thus,  for  example,  s^  =0  implies  that 

the  X.  are  solely  a  function  of  }J,  which  is  indicated  by  setting 

—  2  2 

true  variance  equal  to  -  0  =  a  .  There  is,  of  course,  no 

such  variance  (i.e.,  v  is  a  constant),  but  the  heuristic  shifts 
the  basis  of  analysis  to  a  different  logical  system,  based  on 

7 

0^  ,  in  which  it  is  possible  to  estimate  interrater  reliability. 

An  estimate  of  interrater  reliability  is  obtained  by  placing 

the  estimates  of  the  variances  in  the  equation; (true  variance)/ 

2  2  2  2 

(true  variance  +  error  variance),  or  (oE  -  s^  )/[(oE  -  sx  )  + 

2  2  2  2  — 
s x  ]  =  (°E  *  s x  ) /°E  •  This  equation  reduces  to  the  equation 

~~  2  2 
suggested  by  Finn  (1970,  p.  72),  namely  1  -  (sv  /cr„  )  ,  where 

_ a  _ h 

2  2 

(sY  /a  )  estimates  the  "proportion  of  random  or  error  variance 
a  E 

—  —  2  2 
present  in  the  observed  ratings”,  and  1  -  (s^  / a E  )  "gives  the 

proportion  of  non-error  variance  in  the  ratings,  a  reliability 

coefficient . " 

To  summarize,  the  equation  for  estimating  interrater 


reliability/agreement  is: 
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where : 


t  2  e  2W  2 
rWG  "  toE  fx  •)/aE 

.  ,  2.  2 
=  1  (^X  ) 


r  =  within-group  interrater  reliability  for  a 
WG 

single  group  of  raters  who  have  rated  the  same 
target  on  one  discrete,  equal  interval 
variable , 

2 

s  =  the  observed  (error)  variance  on  variable  X 

A  — 

for  the  n^  raters  in  group  k, 

2  — 

Og  =  the  variance  on  X  that  would  be  expected  if 

the  raters  responded  randomly,  which  is 
2 

estimated  by  (A  -  1)/12  for  a  discrete, 

uniform  distribution  (see  Eq .  2). 

Note  that  perfect  interrater  reliability/agreement  is 
2 

indicated  by  s^  =  0 ,  in  which  case  r^g  =  1.0.  Conversely,  equal 
likelihood  of  response  connotes  zero  reliability  and  no  agreement, 
in  which  case  s^  =  <jg  and  r^  =  0.  Given  the  usual  condition 
in  which  0  <  s^  <  ,  as  s^  approaches  ,  agreement  decreases 

or,  as  s^  becomes  progressively  smaller  than  Og  ,  agreement 
increases. 


The  use  of  Eq.  3  is  illustrated  by  application  to  the  data 

in  Table  1.  With  A  =  S  (i.e.,  the  item  has  five  alternatives), 

2  2 
<jg  =  2.0  in  each  of  the  two  groups  [(5  -  1 ) / 1 2 ]  .  Inserting 

2  2 
the  values  of  and  the  observed  variances  (s^  )  into  Eq.  3 

supplies  the  desired  estimates  of  r^;  r  for  Group  1  is  .89 

[i.e.,  1  -  (.211/2)],  and  r  for  Group  2  is  .87  [i.e.,  1  - 
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(.261/2)].  Clearly,  values  of  .89  and  .87  are  different  than 
the  intraclass  correlation  of  .00,  and  it  is  just  as  clear  that 
the  former  values  are  more  consistent  with  the  data  than  the 
latter  value.  Furthermore,  given  the  similarity  of  the  two 
values  of  rw_ ,  it  is  possible  to  average  the  values  and  obtain 
an  overall  estimate  of  interrater  reliability  for  both  groups. 
Averaging  is  not  recommended  if  the  values  of  r^  are  dissimilar 
for  the  obvious  reason  that  the  average  would  be  misleading  for 
at  least  some  groups.  A  homogeneity  of  variance  test  on  observed 
variances  might  be  used  to  decide  whether  to  average  the 
coefficients  across  all  groups,  or  perhaps  subsets  of  groups. 
[Given  homogeneity  of  variance,  the  average  rWG  may  be  estimated 
by  1  -  (MSW/aE2)]. 

It  should  also  be  mentioned  that  o£  is  not  contingent  on 

the  number  of  individuals  in  a  group.  Eq.  2  indicates  that  the 

expected  variance  of  X  given  random  response  and  A  =  5  will  be 

2.0  regardless  of  group  size  (n^) .  Moreover,  Eq.  2  may  be 

2  — 

employed  to  calculate  for  discrete  scales  of  any  length. 

2 

For  example,  if  X  assumes  values  of  1  through  4,  then  a£  =  1.25, 

2  — 

while  values  of  1  through  7  result  in  a£  =4.  On  the  other 
hand,  group  size  has  other  implications  for  the  use  of  Eq.  3, 
and  it  is  possible  to  abuse  the  use  of  discrete  scales.  These 
points  are  addressed  later  in  this  paper. 

Within-Group  Agreement  on  Composite  Scores 

Data  employed  in  an  incomplete  design  are  often  based  on 
a  composite  score  rather  than  a  single  item.  Within  each  group, 
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the  composite  score  takes  the  form  of  a  sum  or  a  mean  per  rater 
over  items  designed  to  measure  the  same  construct.  Examples 
would  be  a  set  of  items  to  be  combined  to  furnish  a  composite 
measure  of  workgroup  morale  or  team  effectiveness  in  each  of  K 
groups.  We  will  focus  here  on  an  estimate  of  interrater  relia¬ 
bility  among  raters'  composite  scores  on  a  set  of  J  (j_=l,...,J) 
items  in  each  of  two  groups.  It  is  assumed  that  (a)  the  J  items 
are  a  random  sample  from  a  well-defined  domain  of  items;  (b)  the 
n^  raters  in  each  group  are  randomly  sampled  from  a  population 
of  raters,  and  inferences  will  be  made  to  that  population;  and 
(c)  the  item  variances  and  interitem  covariances  are  equal,  respect¬ 
ively,  in  the  rater  population,  which  implies  that  the  items  are 
considered  to  be  "essentially  parallel"  indicators  of  the  same 
construct . 

An  example  of  the  design  in  question  is  presented  in  Table 
2,  which  represents  a  facsimile  of  a  problem  encountered  in 
research  on  agreement  among  performance  ratings.  The  target  for 
Group  1  is  a  probationary  pilot,  rated  on  knowledge  of  safety 
procedures  independently  by  five  senior  pilots  (n^  =  5)  on  four 
items  (J  =  4)  designed  to  measure  safety.  Each  item  employs  the 
same  seven  discrete,  equally  spaced  alternatives  (A  =  7).  The 
target  for  Group  2  is  a  different  probationary  pilot,  rated 
independently  by  a  different  set  of  senior  pilots  (n^  =  6)  on 
the  same  four  safety  items.  The  between-group  ICC ,  based  on  the 
rater  composite  (mean)  scores  (shown  at  the  bottom  of  each  data 
matrix)  is  approximately  .00,  a  result  of  the  fact  that  the  mean 
composite  score  for  each  group  is  6.5.  Moreover,  the  within- 


mmm 
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group  ICC  for  each  group  is  approximately  .00  [cf.  Shrout  § 
Fleiss,  1979,  equation  for  I CC  (2,1)].  This  is  a  result  of  the 
fact  that  the  items,  essentially  parallel  indicators  of  the  same 
construct,  have  approximately  identical  means  in  each  group,  from 
which  it  follows  that  each  between-item  mean  square  is  close  to 
zero . 


Insert  Table  2  about  here 


Are  we  to  conclude  that  the  senior  pilots  lacked  agreement 

in  regard  to  probationary  pilots'  safety  procedures?  Certainly 

2 

not;  the  variance  (now  designated  s^  )  and  rWG  for  each  item, 

shown  in  columns  to  the  right  of  each  data  matrix,  indicate  high 

levels  of  agreement.  In  fact,  the  average  rWG ,  designated  r^7, 

is  .925  for  Group  1  and  .93  for  Group  2.  The  separate  rWGs  were 

calculated  using  aE  =  4.0  [i.a,7  -1)/12],  and  rWG  may  be  calcula- 

2 

ted  for  each  group  because  the  s^  s  are  the  same  or  similar. 

An  average  of  the  r^s  for  the  two  groups  may  also  be  estimated 

2  2  2 

(=.93)  because  of  the  similar  and  mean  s^  ,  or  s^  ,  for  each 


group.  [s^  is  equal  to  the  within-item  mean  square  in  the  ANOVA 


for  each  group,  and  r^G  may  be  estimated  by  1  -  (s^  /o^  ) (Finn, 


1970)]  . 

Wh 

raters,  it  fails  to  take  into  account  the  "boost"  in  reliability 


While  r^G  for  each  group  indicates  agreement  among  the 
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to  be  expected  from  combining  essentially  parallel  items  to  form 
a  composite.  That  is  to  say,  we  would  expect  the  estimate  of 
agreement  based  on  the  composite  ratings  per  rater  (i.e.,  a  mean 
or  a  sum  taken  over  items)  to  be  higher  than  the  estimate  of 
agreement  based  on  the  r^ .  This  point  is  illustrated  by  deriving 
an  equation  to  estimate  agreement  among  the  composite  (mean) 
ratings  per  rater  in  each  group.  [Note  that  this  is  not  the 
procedure  typically  employed  in  ICC  designs,  which  consists  of 
estimating  interrater  reliability  among  ratee  (item)  means,  based 
on  aggregation  over  raters.] 

The  derivation  of  an  interrater  reliability  coefficient  for 
composite  scores  in  one  group  is  predicated  on  extrapolating  from 
the  logic  for  one  item.  The  model  equation  is  K.  =  v  +  (X  -  jj)  + 
eT,  where  XT  and  ¥7  are  the  mean  observed  and  error  scores  for 
the  i**1  rater  on  the  J  items,  respectively,  jm  is  the  population 
mean  (equivalent  for  all  items),  and  X  is  the  observed  grand  mean. 
As  in  the  case  of  a  single  item,  variance  of  the  X^  scores  in 
a  sample  arises  only  from  variation  in  the  e^ .  If  the  J  items 
are  equivalent  (essentially  parallel)  and  are  reflective  solely 
of  £,  then  the  variance  of  the  XT  scores  will  be  equal  to  zero, 
implying  perfect  agreement.  Variation  in  the  X^  scores  denotes 
departure  from  perfect  agreement.  Given  essentially  parallel 


items,  the  variance  among  the  X^  scores  may  be  estimated  by _ 

2  2  2  2  —  2 

J(Sy  ) / J  =  sx  /J,  where  s^  is  the  mean  item  variance,  J(s^  ) 


estimates  the  error  variance  among  sums  taken  over  J  items 

2 

(Gulliksen,  1950),  and  division  by  J  estimates  the  error  variance 
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among  means.  We  will  refer  to  this  variance  as  "error  variance". 

As  discussed  earlier,  the  nonerror  or  "true  variance"  for 

2  2  2 

an  item  is  estimated  by  the  heuristic  ( -  s^  ) ,  where  is 

employed  as  a  benchmark  to  indicate  the  expected  variance  of  the 

scores  on  an  item  j  (or  X _ )  associated  with  equal  likelihood 

of  response  and  zero  interrater  reliability.  On  J  essentially 

parallel  items,  the  true  variance  for  each  item  may  be  estimated 
2  ” 

by  (op  -  sx  ) »  from  which  it  follows  that  the  true  variance  among 

—  _ j 

2  2  JT  2  2 

means  taken  over  items  is  estimated  by  J  (og  -  )  /  J  =  (a^ 

2  2  2  ~ 
s^  )  [Gulliksen,  1950;  where  J  (aE  -  sx  )  is  the  estimated  true 

variance  for  sums].  Thus,  the  estimated  true  variance  associated 
with  variation  among  the  X^  is  the  same  as  that  associated  with 
each  item. 

The  interrater  reliability  associated  with  agreement  among 
the  mean  scores  in  a  group,  designated  )  ,  may  now  be  esti¬ 

mated  as  follows: 


^  ^  *  c^J/i 


-^°E  '  SX4') 


— (OE  +  SL 
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Equations  4  or  5  furnish  a  computing  procedure  for  r^G^x  ^ 


It  is  also  possible  to  demonstrate  that  these  equations  provide 

an  estimate  that  is  equal  to  the  Spearman -Brown  (SB)  prophecy 

equation  applied  to  r  ,  the  correction  factor  being  the  number 

WG 

of  items.  This  equality  involves  dividing  the  numerator  and 

2 

denominator  of  Eq.  5  by  ,  which  is: 


Jt1  *  (4./aE2)] 

rWG(Xi)  =  - =^LSZI -  (6) 


J[1  -  (sx  /a2)}  +  (s  /a  2) 

j  —  j  — 


2  2  _  2  2  — 

where  1  -  (sx  /aE  )  =  rWG  and  (sx  /oE  )  =  1  -  rWG;  thus  Eq.  6 

reduces  to  J(r^)/(J(r^)  +  (1  -  rWG)],  or 


^(rWG)/[1  +  (~  '  1)rWG]>  (7) 

which  is  the  SB  equation. 

Applied  to  the  data  in  Table  2,  Eq.  5  (and  Eq.  7)  provides 
the  following  estimates  of  agreement  for  Groups  1  and  2,  respect¬ 
ively  : 

Group  1:  rwc(X)  =  4^4  ‘  • 30)/ [4(4  -  .30)  +  .30]  =  .98 

Group  2:  rWG^.)  =  4(4  *  -  285)/ [4(4  -  .285)  +  .285]  =  .98 

Thus,  given  that  assumptions  are  satisfied  regarding  essen¬ 
tially  parallel  items,  r^G^x  )  will  exceed  ,  unless  the  latter 

statistic  is  .00  or  1.00.  Furthermore,  the  r^G ^  ^  may  be  averaged 

2 

over  the  two  groups  inasmuch  as  the  sx 

j 


L 


are  similar. 
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Finn  (1970)  also  addressed  within-group  interrater  reliability 
for  a  set  of  items  on  which  item  means  were  essentially  equal, 
and  recommended  the  use  of  Eq.  7  (the  SB  equation)  to  estimate 
reliability  for  items,  where  r^7  was  interpreted  as  the  "mean 
reliability  per  item."  Finn  did  not,  however,  furnish  statistical 
justification  (or  derivation)  for  the  SB  equation,  including, 
in  particular,  the  requirement  that  the  items  be  essentially 
parallel  indicators  of  the  same  construct.  This  is  a  critical 
requirement  because  it  justifies  the  derivation  of  Eqs .  5  and  7 
and  suggests  that  the  composite  scores  are  interpretable  in  ref¬ 
erence  to  an  underlying  construct.  Moreover,  the  procedures 
apply  to  aggregation  over  items,  and  not  raters,  a  point  confused 
by  Bintig  (1980).  That  is,  in  a  review  of  the  Finn  (1970)  paper, 
Bintig  interpreted  the  Finn  procedure  as  an  estimator  of  inter¬ 
rater  reliability  for  aggregates  taken  over  raters  for  each  ratee 
(items  in  this  paper) .  It  might  also  be  noted  that  Bintig  used 
an  erroneous  estimate  of  ( i . e .  ,  a  value  of  3.5  was  used  for 

a  seven-point  scale,  which  applies  to  neither  discrete  nor 
continuous  scales) . 

In  conclusion  rWG^. )  is  applicable  in  incomplete  designs 

when  (a)  items  on  which  composites  (per  rater)  are  based  are 

essentially  parallel  indicators  of  the  same  construct;  (b)  the 

mean  composite  score  for  each  group  is  approximately  the  same, 

and  (c)  little  variation  exists  among  raters  in  each  group.  It 

2~ 

is  possible,  of  course,  for  sx  ,  and  therefore  rWG^  to  vary 


j 


Interrater  Reliability 
19 


as  a  function  of  group,  in  which  case  the  rwG(x)  should 


be 


interpreted,  and  reported,  separately  for  each  group.  On  the 

2 

o the r  hand,  the  s^  and  r^g^x.)  may  be  similar  over  groups,  which 

2 

can  be  tested  by  a  homogeneity  of  variance  test  on  the  Sx  (the 

3 

within-item  mean  squares).  Given  similarity,  the  rwG(x.)  can 

be  averaged  over  groups.  [If  the  decision  is  to  average  the 
rWG(X  )  anc*  t*ie  nk  differ,  there  would  be  little  reason  to  weight 


the  r^G^  -j  by  n^  because  the  r^G^  ^  are  similar.  The  s 


ame 


argument  applies  to  r^G] •  In  other  words,  the  reasons  for  using 
rWG(X.)  rather  than  an  ICC  approach  in  incomplete  designs  are 

the  same  as  those  for  using  r^G . 

Discussion 

In  regard  to  incomplete  designs ,  it  has  been  demonstrated 
that  rWG  and  rWG^x.)  provide  more  accurate  estimates  of  interrater 

reliability/agreement  than  an  ICC  when  within-group  variance  is 
small  and  differences  among  group  means  on  an  item  or  a  composite 
(per  rater)  are  essentially  nonexistent.  In  effect,  rw„  and 
rWG(X)  finish  an  alternative  source  of  estimation  when  the  range 

of  values  selected  by  raters  on  an  item,  or  on  a  set  of  items, 
is  restricted  for  at  least  some  groups.  Moreover,  unlike  the 
ICC,  calculation  of  r^G  and  r^G ^  ^  is  not  dependent  on 


i 
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homogeneity  of  within-group  variances,  and  thus  separate  estimates 
of  interrater  reliability/agreement  may  be  calculated  for  each 
group  in  the  absence  of  such  homogeneity.  On  the  other  hand, 
if  variances  are  homogeneous,  then  the  estimates  may  be  averaged 
over  groups  to  provide  an  overall  estimate  of  agreement. 

Given  homogeneity  of  within-group  variances,  r^G  and  rWG^.) 


will  lose  their  advantage  over  the  ICC  for  incomplete  designs 

as  (a)  within-group  variances  on  an  item  or  composite  score  increase, 

or  (b)  the  within-group  variances  remain  small  but  differences 

among  the  mean  group  item/composite  scores  increase  (i.e.,  MSB 

in  Eq.  1  increases  in  value).'*'  The  latter  point  is  of  major 

concern  because  it  raises  the  question  of  when  to  use  the  methods 

suggested  here  versus  an  ICC  approach,  given  that  at  least  some 

variation  exists  among  group  means.  Future  research  is  needed 

to  answer  this  question,  where,  for  example,  a  Monte  Carlo  study 

would  help  to  clarify  the  conditions  (e.g.,  degree  of  variation 

2  ~ 

among  group  means,  in  relation  to  the  magnitudes  of  Sy  ,  s  , 

and  n^)  which  determine  variation  among  within-group  coefficients 
and  ICCs  in  nonobvious  situations.  A  Monte  Carlo  study  is  not 
attempted  here,  although  a  brief  illustration  of  point  "b"  above 
is  presented  using  the  data  in  Tables  3  and  4.  Table  3  has  the 
same  pattern  of  ratings  as  Table  1  (i.e.,  low  within-group 
variances);  however,  a  moderate  difference  in  group  means  (1.05 
scale  points)  was  introduced  by  adding  a  constant  of  1.0  to  the 
scores  in  Group  2.  The  resulting  ICC  is  .70,  which  compares  much 


Interrater  Reliability 
21 

more  favorably  than  the  ICC  of  .00  (Table  1)  to  the  average 
(over  groups)  rwr,  of  .88.  Table  4  again  has  the  same  pattern 
of  ratings  as  Table  1,  but  a  large  difference  in  group  means 
(2.05  scale  points),  achieved  by  adding  and  subtracting  constants. 
The  ICC  in  Table  4  is  .90,  which  is  slightly  larger  than  the 
average  r^G  of  .88. 

Insert  Tables  3  and  4  about  here 


The  preceding  example  is  illustrative  of  the  course  of  action 
suggested  for  incomplete  designs  at  the  present  time.  First,  if 
within-group  variances  appear  nonhomogeneous ,  then  conduct  a 
homogeneity  of  variance  test.  If  homogeneity  is  rejected,  then 
employ  r^G  or  rWG^.  -j  t0  estimate  interrater  reliability  for  each 

group  and  do  not  average  estimates  over  groups  (at  least  over 
nonhomogeneous  groups).  If  homogeneity  is  not  rejected,  then 
compute  both  an  ICC  (for  an  incomplete  design)  and  an  average 
rWG  or  over  groups.  If  the  estimates  differ,  then  review 

the  raw  data  matrix  and  summary  statistics  (e.g.,  group  means, 
within-group  variances)  in  relation  to  the  two  estimates  and 
ascertain  which  estimate  appears  to  provide  the  more  accurate 
point  estimate.  Finally,  report  both  estimates  and  the  rationale 
for  selecting  one  as  more  accurate. 

This  article  is  concluded  with  brief  discussion  of  concerns 


and  potential  problems  regarding  the  use  of  r^G  (the  discussion 
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applies  to  r^Q(x.)^"  Selva8e  (1976)  and  Hsu  (1979)  argued  that 

the  theoretical  distribution  of  X  employed  in  the  calculation 
2 

of  a_  (Eq.  2)  should  be  thought  of  as  normal  rather  than  rect- 
E 

angular.  This  argument,  however,  misses  the  point  made  earlier 

that  o  2  is  a  theoretical  benchmark  used  to  indicate  equal  likeli- 
E 

hood  of  response  and  zero  reliability,  and  makes  possible  an 
assessment  of  the  degree  to  which  actual  responses  resemble  random 
responses.  This  benchmark  is  lost  if  the  theoretical  distribution 
is  assumed  normal  for  the  simple  reason  that  a  normal  distribution 
already  reflects  partial  agreement  (i.e.,  there  are  more  scores 
clustered  about  the  mean  than  in  the  tails  of  the  distribtuion) . 

It  would  appear  unwise  to  employ  a  theoretical  benchmark  for  lack 
of  agreement  that  already  reflects  partial  agreement.  Consequent¬ 
ly,  the  use  of  a  rectangular  distribution  for  item  distributions 
is  recommended.2 

Selvage  (1976,  p.  606)  argued  further  that  although  raters 

might  use  only  five  (or  six,  etc.)  points  on  a  rating  (item) 

scale,  the  points  "are  only  representative  of  possible  values 

along  the  continuum  from  one  to  five."  This  implies  that  the 

distribution  underlying  the  random  variable  X_  should  be  regarded 

as  continuous  (i.e.,  represents  an  infinite  number  of  values) 

2 

rather  than  discrete  in  the  calculation  of  o  •  This  argument 

E 

has  validity,  but  it  is  also  the  case  that  an  argument  can  be 
made  for  discrete  scales.  For  example,  one  could  argue  that  the 
alternatives  in  a  five  to  nine  point  scale  encompass  sufficiently 
the  degrees  (categories)  of  cognitive  differentiation/sensitivity 
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used  by  most  individuals.  On  the  other  hand,  a  continuous  scale 

may  be  advisable  in  some  cases,  and  the  reader  is  referred  to 

2 

the  Selvage  paper  for  statistical  procedures  to  estimate  . 

Additional  concerns  include  bias  in  the  estimate  of  r,.,„ , 
estimates  of  less  than  zero,  and  artificial  manipulation  of 
estimates  by  unrealistic  measurement  scales.  In  regard  to  bias, 
r^Q  may  be  thought  of  as  a  function  of  two  unbiased  values; 
is  an  unbiased  estimate  of  a ^  for  observed  values  on  X,  and  o£ 

is  a  population  parameter.  Nevertheless,  a  ratio  of  unbiased 

2  2 

values  (i.e.,  s^  /o^  )  is  itself  biased  (Winer,  1971).  However, 
like  the  ICC  in  Eq .  1,  which  is  biased  for  the  same  reason,  the 
bias  in  r^G  is  expected  to  be  minimal  for  small  n^  and  essentially 
negligible  for  large  nk . 

It  is  possible  for  r^G  to  assume  values  of  less  than  zero. 

In  fact,  a  number  of  theoretical  distributions  of  observed  ratings 
2  2 

results  in  an  s^  greater  than  ,  and  thus  a  negative  r WG> 

For  example,  given  one  item  with  =  5  and  n^  =  10,  if  five 
raters  selected  alternative  1  and  five  raters  selected  alterna¬ 
tive  5,  rWG  would  be  equal  to  -1.22  (i.e.,  1  -  ^  .  However .every 

distribution  of  observed  X.  that  could  result  in  a  negative  rWG 
would  reflect  rather  serious  degrees  of  disagreement.  Consequent¬ 
ly,  it  is  recommended  that  negative  estimates  be  set  equal  to 
zero  to  indicate  lack  of  agreement  among  raters. 

Of  special  importance  is  the  fact  that  it  is  possible  to 
manipulate  artificially  the  value  of  r^G  by  constructing  unreal¬ 
istic  measurement  scales.  Suppose,  for  example,  that  an  individ¬ 
ual  constructed  a  meaningful  seven -point  scale  that  encompassed 


•4*  ‘ . 
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all  likely  responses.  Suppose  further  that  this  individual  added 
three  spurious  alternatives  to  each  end  of  the  scale;  that  is, 
alternatives  with  a  zero  base-rate  (e.g.,  a  teacher  evaluation 
such  as:  This  teacher  has  never  made  even  the  most  trivial 

2 

mistake).  We  now  have  a  13-point  scale,  resulting  in  an  og  = 

14,  when  in  fact  the  true  scale  with  seven  points  should  have 
2 

a  o  =4.  Finally,  suppose  that  the  distribution  of  observed 
b 

values  on  X  is  uniform  on  the  true  seven-point  scale ,  which  suggests 

an  interrater  reliability  of  zero.  The  interrater  reliability 

2 

is  not,  however,  zero.  For  example,  with  n^  =  21  and  s^  =  4.2, 
rather  than  the  accurate  rWG  =  .00  [i.e.,  1  -  (4.20/4.0],  rWG 
is  .70  [i.e.,  1  -  4.20/14.0).  This  is  the  result  of  artificially 
adding  six  spurious  alternatives  to  the  scale. 

A  different  problem  occurs  with  using  too  short  a  scale, 
where,  for  example,  the  observed  distribution  on  a  three-point 
scale  could  appear  approximately  uniform.  On  a  longer  but  mean¬ 
ingful  scale  (e.g.,  seven  points),  the  scores  might  spread,  but 
the  locus  of  points  could  remain  dense  within  the  original  three 
points.  These  conditions  imply  that  the  r  for  the  three-point 
scale  will  be  artificially  low.  In  general,  it  would  seem  that 
too  short  a  scale  would  be  a  result  of  poor  research  practice 
rather  than  artificial  manipulation,  although  the  latter  condition 
is  a  possibility  if  a  vested  interest  existed  in  obtaining  a  low 
interrater  reliability. 


In  conclusion,  use  of  the  procedures  discussed  here  rests 
on  the  assumption  that  the  measurement  scale  is  meaningful.  This 
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does  not  suggest  that  all  points  on  the  scale  have  to  be  used 
in  every  sample;  it  suggests  only  that  the  scale  is  sensitive 
to,  and  limited  to,  psychometrically  reliable  differentiation 
on  the  measured  attribute.  Valid  scaling  procedures  in  conjunction 
with  professional  and  ethical  judgment  should  satisfy  this  criterion. 
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*A11  references  to  the  ICC  approach  in  this  section  refer 
to  the  ICC  for  incomplete  designs  (Eq .  1).  It  is  assumed  that 
lack  of  variation  among  item  means  would  generally  preclude  the 
use  of  the  within-group  ICC ■ 

2 

The  underlying  theoretical  distribution  for  rater  composite 
scores  in  Eqs .  5  and  7  is  normal,  a  result  of  the  central  limit 
theorem.  This  does  not  detract  from  the  fact  that  equal  likeli¬ 
hood  of  response  on  each  item  implies  that  one  begin  with  a 
rectangular  distribution  for  each  item. 
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Table  1 


Intraclass  Correlation  for  Twenty  Raters 


Nested  in  Each  of  Two 

Groups 

Scale  for 
Variable  X 

Frequencies  of  Scores 
in  Group  1 

Frequencies  of  Scores 
in  Group  2 

1 

0 

0 

2 

2 

2 

3 

16 

15 

4 

2 

3 

5 

0 

0 

Mean 

:  3.00 

3.05 

Variance 

:  .211 

.261 

Analysis  of  Variance 


Source 

df 

SS 

MS 

Be tween -Groups 

1 

.025 

.025 

F  =  .  106— 

Within-Groups 

38 

8.959 

.236 

> 

Intraclass 

Correlation 

_  .025  -  .236 

.025  +'  (19) (.236) 

=  -.047 

S  .00 


Note :  NS  ■  not  significant  at  £  <  .05. 
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Table  3 


Comparison  of  Interrater 

Reliabilities 

for 

Two  Groups  with  Moderate 

Mean  Differences 

Scale  for 
Variable  X 

Frequencies  of 
Scores  in  Group  1 

Frequencies  of 
Scores  in  Group  2 

1 

0 

0 

2 

2 

0 

3 

16 

2 

4 

2 

15 

5 

0 

3 

Mean 

3.00 

4.05 

Variance  .211 

.261 

r  1 

WG 

.89 

.87 

ICC 

.70 

Analysis  of  Variance 

Source 

Between-Groups 

Within-Groups 


df 

1 

38 


SS 

11.025 

8.959 


MS 

11.025  F=46.79* 

.236 


*  £  <  .01 

2  2  2 

^Interrater  reliability  based  on  w^ere  =  2.0. 


Table  4 


Comparison  of  Interrater  Reliabilities 
for  Two  Groups  with  Large  Mean  Differences 


Scale  for 
Variable  X 

Frequencies  of 
Scores  in  Group  1 

Frequencies  of 
Scores  in  Group  2 

1 

2 

0 

2 

16 

0 

3 

2 

2 

4 

0 

15 

5 

0 

3 

Mean 

2.00 

4.05 

Variance 

.211 

.261 

r  1 
_WG 

.89 

.87 

ICC 

.90 

Analysis 

of  Variance 

Source 

df 

SS 

MS 

Between-Groups 

1 

42.025 

42.U25 

F=178.43* 

Within-Groups 

38 

8.959 

.236 

*  £  <  .01 

1  2  ? 

Interrater  reliability  based  on  1  -  (sY  /<J  ),  where  o. 

*  a  h 


2.0. 


